78 - HPC Café on January 14, 2025: The Power of Workflow Systems on HPC Clusters - An Introduction to Snakemake [ID:55987]
50 von 422 angezeigt

Thank you.

Thank you.

Yeah, well, it's quite crowded considering also the online participants more than I anticipated.

Welcome.

Yeah, I want to talk about the importance of workflow engines in contemporary data analysis

and end up the title for admins because I anticipated that some of my fellow administrators

and service providers are along with us and HPC users or data analysts.

I'm from the NHR Southwest in Renham, Palatinate as Gere already mentioned and I also represent

the SnakeMate Teaching Alliance, a alliance of some fellow developers.

All right, I want to introduce you to the reason why we ought to use workflow managers

and not necessarily SnakeMake alone. There are, of course, various alternatives.

And then I'll introduce SnakeMake and its capabilities, how we handle software and what's

in there for administrators and users. I'll show you how this can be launched when we get to work

and a little bit about the details with regard to our workflow parameterization and how we can get

from a generic workflow, which runs on a desktop or a laptop onto an HPC system.

So the first part is, I think it's trying to, when they teach HPC 101, they teach the best

batch systems to their users. And I want to put a question mark behind that. Also, I want to

illustrate the benefits of a workflow system for administrators and the difference between a

workflow system and a pipeline and thereby introduce a little bit of SnakeMaking.

Now, data analysis can be quite easy and can be conducted on a simple computer. However,

in the light of rather big amount of data, this can be quite frustrating.

And just learning Slur might not be the last and wisest move. And I'll illustrate here why.

But first, you have to understand when we talk about data analysis, there are constituents,

which I think, and if you back to there, just interrupt me, are always or almost always present.

That is quality control, processing, presumably multiple process steps in processing. Then you do

some summary statistics and then you want to publish. Therefore, you need to visualize,

plot something statically or interactively. Not all such steps, if we be honest, are HPC worthy.

Because for instance, if you do an internet download or move around and tinker with data,

if we get this to your program, that's something you can do on any computer. That's not what you do

on an HPC system, not necessarily. But I want to argue that we can actually do that without

the great harm. For that, I want to introduce you to a duck. I know that many of you are compute

scientists, you know that. A duck is a directed acyclic graph and that's an entity with which we

can display any data analysis. When I talk about you always see these little boxes.

And you do not need to understand that here is a simple workflow. So first, for instance, in

linguistics, we can count words and make a plot about this to put this to a statistical test

that we can archive that these are just not necessarily HPC jobs, but jobs, something which

is carried out by a workflow manager or by the scientist doing this kind of analysis. Then

a full duck also takes into account that you have presumably multiple samples. Here in this

little linguistics example where we count words in books, we might have several books and then we

count this in several times and subject to plotting and one come test and one archive.

And it can get extremely complex. I'll show you. If there are any questions, don't hesitate

interrupting. So this is what you need to have as a background. Just basically every such box,

if you can read it or not, doesn't matter. Every such book stands for a job. So I want to give you

an example. This here is a workflow. What it does is simply some proteotranscomics. So here we do

some mass spec data analysis and we subject this to some genome assembly and annotation. That's

something which we can conduct on an HPC system and it's quite HPC worthy because it's extremely

compute intensive. Hence, what does it take to run this on an HPC cluster? And I can be proud of

myself because I've been teaching all these intro courses for quite a while and one of those students

got to work and implemented this very workflow with bash and

slurm commands alone. I'll show you how it looks like. So these are all the files. Here we go.

Teil einer Videoserie :
Teil eines Kapitels:
HPC Café

Zugänglich über

Offener Zugang

Dauer

00:53:07 Min

Aufnahmedatum

2025-01-17

Hochgeladen am

2025-01-17 18:46:03

Sprache

en-US

Speaker: Dr. Christian Meesters, Johannes Gutenberg University Mainz

Date: January 14, 2025

Slides

Abstract:

This talk highlights the benefits of using workflow management systems, with a focus on Snakemake, for multistep data analysis on high-performance computing (HPC) clusters. It shows how workflows can streamline research by automating tasks, managing software environments (e.g., Conda, containers, module files), and handling HPC-specific requirements like resource allocation and job submission. We introduce the Snakemake workflow catalog, a resource for prebuilt workflows to save time and avoid reinventing the wheel. Parameterization enables workflow flexibility and scalability. Finally, the talk will explore how Snakemake facilitates reproducibility, from deployment to comprehensive workflow reports with execution statistics and publication-ready outputs. 

Material from past events is available at: https://hpc.fau.de/teaching/hpc-cafe/

Einbetten
Wordpress FAU Plugin
iFrame
Teilen